When working with text, it is important to be aware of the original document structure, to see what certain parts mean and their relevance to the document’s content.
To process the document, we first read it using Tika, and then we remove various unimportant parts of the text using regular expressions.
First we read in the telegramgate PDF document (which you can download here) using Tika, and we preview the first 200 characters.
We notice that the document has a lot of new lines, so we remove them using the string method
.strip(), and now the first 399 characters of
pdf_content look like this:
[Notice that the line ‘lComo se col6 eso?’ was misread from ‘¿Como se coló eso?’. Tika read some upside down punctuation marks and accented letters incorrectly, for example, it read ó as 6 and í as f. We do not propose a fix for this in this article.]
The date “1/20/2019” and the text “Telegram Web” appear at the beginning of every page, which most likely implies that the Telegram conversation was downloaded on that date, January 20 of 2019. At the end of every page there is a telegram URL and the page you are on out of all pages in the PDF. We can use the page number for keeping track of the page in the PDF for which a message appears, so we should keep that. But the date and text “Telegram Web” at the top and the link at the bottom of every page should be removed, it is not useful to the analysis.
A good place to test your regular expressions is at https://regex101.com/.
There are various time stamps which do not align with any messages.
We will assume that the time in which a message was sent is unimportant to the analysis, and thus will be removed.
pdf_content, you may notice that there are a lot of two letter acronyms. From viewing the PDF, you can tell that a lot of these 2-letter acronyms correspond to the chat members which do not have profile pictures. These acronyms are not useful to the analysis of the document, but we should not attempt to remove them until we know who are the members who participated in the chat. We will remove them in the next section titled “Organize Content”, where we determine who are the admin chat members.
After these steps we could not identify any more strings to remove from the plain text, so we create a list from the
pdf_content by separating the string by the new lines, and we start extracting information.
Now we are ready to organize the data.
In order to organize/structure the information from the PDF, we decide to identify the chat members and which messages were sent per member.
We could use text formatting to automatically identify all members present in the chat, but we were not able to read in the PDF document with formatting included (we read it in using the Tika CLI and the tags xml and html but neither kept the original formatting of the PDF).
So we will have to suffice with identifying the admins automatically and identifying the other members manually (feel free to leave a comment on ways we could automatically retrieve non-admin chat members from the PDF).
When reviewing the PDF’s plain text, we noticed that the lines containing an admin user’s name contain the term ‘admin’. We isolate the lines which contain this term, in order to identify the admin users in the chat.
Now we clean these results by removing the text ‘ admin’ and the text that follows it.
Some admin users are repeated, differing only by the text ‘via@gif’. We remove the elements which contain this text, since those elements are redundant.
When examining the resulting list, you can tell that there are two elements which are most likely typos (user names misread by Tika). These elements are ‘Fdo’ and ‘R Russello’. We remove these elements and have our final list of admin chat members.
The misread usernames are still present in the document. We replace them with the correct spelling.
There is a total of 8 admin members in the chat group. The admin members are the following:
According to the first line of the PDF, which says ‘WRF 12 members’, there is a total of 12 members in the chat group. This means that there are 4 non-admin chat members in the group. After going through the PDF, we found that the remaining 4 chat members are:
Unfortunately we were not able to extract this information within the Python script. We will be adding these usernames together with the admin usernames into a consolidated list of total chat members.
It’s not cheating …
Now that we know who the chat members are, we can remove the 2-letter acronyms associated to some of the chat members who don’t have profile pictures. We build all potential acronyms for the users.
Now that we have the potential acronyms, we search which elements in
pdf_lines contain only one of these acronyms and we remove those lines.
Now to the meat!
We create a list of dictionaries (which we’ll call
conversation)in which each dictionary element has the following keys:
message: Message sent in chat.
chat_member: Chat member who sent the message.
date: Date (weekday, month, day, year) in which the message was sent.
page_number: Page within the PDF in which the message appears.
We use the lines containing the chat member usernames as indicators of who sent which message. These lines are not be included within the final list, since they are not part of the conversation.
With the data now cleaned up and organized, we can create some visualizations to showcase data insights.
We decided to create a horizontal bar chart where each bar corresponds to a chat group member and the length of the bar represents the number of messages sent by a chat member.
We chose to color each bar differently, so we randomly generate colors and assign a different color to each chat member.
If you are running this code and you do not like the generated colors, feel free to rerun the code and create different colors, or manually define the colors you want per user.
We get the number of messages per chat member and store them into a dictionary where the keys are the usernames and the values are the message counts.
We used matplotlib to draw the bar chart.
The following is the resulting graph:
From the diagram we can see that Edwin Miranda sent the most messages. This may be because he shared many news articles and commented on them in the chat group, as can be seen from the PDF.
We decided to create an abridged telegramgate PDF document, to see how much shorter the document can be made and if the chat’s content can be shown in a clearer, more straightforward format. The resulting PDF would not include images or GIFs, and it would not differentiate shared posts from written posts.
In order to create the PDF, we first create a string in HTML format using the content in the variable
conversation. We write the page number, the username in the colors generated for the visualizations in the previous section, and the message sent by the user. We did not write the date the message was sent. After creating the HTML string, we save it into a file on our local folder, which we name “telegramgate_abridged.html”. Then we use the PDFkit Python library to create the PDF file from the HTML file, and we save the result to “telegramgate_abridged.pdf”.
The resulting PDF is 654 pages long, while the original document is 899 pages. This means the new PDF has 245 fewer pages than the original.
We chose to color-code the usernames in order to easily differentiate who wrote the messages. When the usernames are of the same length and color, they can visually blend together. If you wish to create the PDF without colored usernames, you can do so by removing the
font tags added within the for loop.
You can access the PDF generated from this script here.